{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 验证并导入用户-项目交互数据  <a class=\"anchor\" id=\"top\"></a>\n",
    "\n",
    "在此笔记本中，您将选择一个数据集并准备将其与Amazon Personalize一起使用。 \n",
    "\n",
    "1. [介绍](#intro)\n",
    "1. [选择数据集或数据源](#source)\n",
    "1. [准备数据](#prepare)\n",
    "1. [创建数据集组和交互数据集](#group_dataset)\n",
    "1. [配置S3存储桶和IAM角色](#bucket_role)\n",
    "1. [导入交互数据](#import)\n",
    "\n",
    "## 介绍 <a class=\"anchor\" id=\"intro\"></a>\n",
    "\n",
    "在大多数情况下，Amazon Personalize中的算法（称为recipe食谱）可以解决不同的任务，在此进行解释： \n",
    "\n",
    "1. **用户个性化** - 用户个性化需求的新版本，将在本实验中使用。\n",
    "1. **HRNN和HRNN-元数据** - 根据先前用户与项目的交互来推荐项目。\n",
    "1. **HRNN-Coldstart** - 推荐尚无交互数据的新项目。\n",
    "1. **个性化排序** - 收集项目，然后使用类似HRNN的方法按可能感兴趣的顺序对其进行排序。\n",
    "1. **SIMS（类似项目）** - 给定一个项目，推荐用户可能也与之交互的其他项目。\n",
    "1. **人气计数** - 如果HRNN或HRNN-Metadata没有结果，则推荐最受欢迎的商品--默认情况下返回。 \n",
    "\n",
    "无论用例如何，算法都共享基于3个核心属性定义的用户项交互数据的学习基础：\n",
    "\n",
    "1. **UserID** - 进行交互的用户\n",
    "1. **ItemID** - 与用户互动的项目\n",
    "1. **时间戳** - 互动发生的时间 \n",
    "\n",
    "我们还支持通过以下方式定义的事件类型和事件值：\n",
    "\n",
    "1. **事件类型** - 事件的分类标签（浏览，购买，评分等）。\n",
    "1. **事件值** - 与发生的事件类型相对应的值。 一般而言，我们在事件类型上寻找介于0和1之间的标准化值。 例如，如果分三个阶段完成交易（单击，添加到购物车和购买），则每个阶段的event_value分别为0.33、0.66和1.0。 \n",
    "\n",
    "事件类型和事件值字段是附加数据，可用于过滤为训练个性化模型而发送的数据。 在此特定练习中，我们将没有事件类型或事件值。 \n",
    "\n",
    "## 选择数据集或数据源  <a class=\"anchor\" id=\"source\"></a>\n",
    "[Back to top](#top)\n",
    "\n",
    "如前所述，用户迭代数据是服务入门的关键。 这意味着我们需要寻找生成此类数据的用例，一些常见的示例是：\n",
    "\n",
    "1. 视频点播应用\n",
    "1. 电子商务平台\n",
    "1. 社交媒体聚合器/平台 \n",
    "\n",
    "有一些准则来确定适合个性化的问题。尽管[官方最低要求](https://docs.aws.amazon.com/personalize/latest/dg/limits.html)稍低，但我们还是建议您以以下值作为起点。\n",
    "\n",
    "* 经过身份验证的用户\n",
    "* 至少50个唯一身份用户\n",
    "* 至少100个独特物品\n",
    "* 每个用户至少2打互动\n",
    "\n",
    "在大多数情况下，这是很容易实现的，如果您属于一个类别，则通常可以通过增加另一个类别中的数字来弥补这一不足。\n",
    "\n",
    "一般来说，您的数据将不会直接可用于Personalize，需要进行一些修改才能进行正确构造。这个笔记本可以指导您完成所有这些工作。\n",
    "\n",
    "首先，我们将使用最新的MovieLens数据集，该数据集具有超过2500万次的交互和丰富的项目元数据集合，并且该数据集的版本较小，可用于缩短训练时间，且仍具有与完整数据集相同的功能。将USE_FULL_MOVIELENS设置为True以使用完整的数据集。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "USE_FULL_MOVIELENS = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，您将下载数据集，并使用以下代码将其解压缩到新文件夹中。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dir = \"poc_data\"\n",
    "!mkdir $data_dir\n",
    "\n",
    "if not USE_FULL_MOVIELENS:\n",
    "    !cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip\n",
    "    !cd $data_dir && unzip ml-latest-small.zip\n",
    "    dataset_dir = data_dir + \"/ml-latest-small/\"\n",
    "else:\n",
    "    !cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-25m.zip\n",
    "    !cd $data_dir && unzip ml-25m.zip\n",
    "    dataset_dir = data_dir + \"/ml-25m/\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "查看您下载的数据文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls $dataset_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "目前，除了我们有一些csv和自述文件外，我们知道的不多。接下来我们将输出ReadMe文件来了解更多!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize $dataset_dir/README.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从ReadMe文件中，我们可以看到文件`ratings.csv`可以用作我们的交互数据，毕竟对电影进行打分绝对是与之互动的一种形式。 该数据集还具有一些流派信息，例如电影题材数据。 在本POC中，我们将重点介绍互动和题材数据。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 准备数据 <a class=\"anchor\" id=\"prepare\"></a>\n",
    "[Back to top](#top)\n",
    "\n",
    "下一步要做的是加载数据并确认数据处于良好状态，然后将其保存到可与Amazon Personalize一起使用的CSV中。\n",
    "\n",
    "首先，请导入数据科学中常用的Python库的集合。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from time import sleep\n",
    "import json\n",
    "from datetime import datetime\n",
    "import boto3\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，打开数据文件并查看前几行。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data = pd.read_csv(dataset_dir + '/ratings.csv')\n",
    "original_data.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这表明我们为`userId`和`movieId`提供了一个很好的值范围。 接下来要确认数据格式。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data.isnull().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从中可以看到，数据集中总共有（25,000,095个样本（全量数据集），100836个样本（少量数据集））条目，有4列，每个单元格都以int64格式存储，但float64除外。\n",
    "\n",
    "int64格式显然适用于`userId`和`movieId`。 但是，我们需要更深入地了解数据中的时间戳。 要使用Amazon Personalize，您需要以[Unix Epoch](https://en.wikipedia.org/wiki/Unix_time)格式保存时间戳。\n",
    "\n",
    "当前，时间戳值不是人类可读的。 因此，让我们获取一个任意的时间戳值，并可以解释它。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过选择任意时间戳并将其转换为人类可读的格式，对转换后的数据集进行快速完整性检查。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "arb_time_stamp = original_data.iloc[50]['timestamp']\n",
    "print(arb_time_stamp)\n",
    "print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该日期作为时间戳是有意义的，因此我们可以继续格式化其余数据。 请记住，我们需要的数据是用户项目互动数据，在这种情况下，它们是`userId`，`movieId`和`timestamp`。 我们的数据集还有一个额外的列`rating`，可以在我们利用它来关注积极互动时将其从数据集中删除。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于这是一个明确的反馈电影评分数据集，因此它包含评分为1到5的电影，因此我们只希望包含用户“喜欢”的动作，并模拟一个隐式数据集，该数据集更像是由用户收集的数据。 VOD平台。 为此，我们将过滤掉5中2分以下的所有互动，并创建`click`的`EVENT_Type`和` watch`的`EVENT_Type`。 然后，我们将所有评分2及以上的电影分配为“点击”，将评分4及以上的电影分配为“点击”和“观看”。\n",
    "\n",
    "请注意，这与我们正在建模的事件相对应，对于实际数据集，您实际上将基于隐式反馈（例如点击，观看和/或显式反馈（例如评分，喜欢）等）进行建模。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "watched_df = original_data.copy()\n",
    "watched_df = watched_df[watched_df['rating'] > 3]\n",
    "watched_df = watched_df[['userId', 'movieId', 'timestamp']]\n",
    "watched_df['EVENT_TYPE']='watch'\n",
    "watched_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clicked_df = original_data.copy()\n",
    "clicked_df = clicked_df[clicked_df['rating'] > 1]\n",
    "clicked_df = clicked_df[['userId', 'movieId', 'timestamp']]\n",
    "clicked_df['EVENT_TYPE']='click'\n",
    "clicked_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df = clicked_df.copy()\n",
    "interactions_df = interactions_df.append(watched_df)\n",
    "interactions_df.sort_values(\"timestamp\", axis = 0, ascending = True, \n",
    "                 inplace = True, na_position ='last') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "让我们看一下新数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "处理完数据后，请务必确认数据格式是否已更改。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Amazon Personalize具有用户，项目和时间戳记的默认列名称。 这些默认列名称为`USER_ID`，`ITEM_ID`和`TIMESTAMP`。 因此，对数据集的最后修改是将现有列标题替换为默认标题。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', \n",
    "                              'timestamp':'TIMESTAMP'}, inplace = True) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "至此，数据已准备就绪，我们只需要将其另存为CSV文件即可。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_filename = \"interactions.csv\"\n",
    "interactions_df.to_csv((data_dir+\"/\"+interactions_filename), index=False, float_format='%.0f')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 创建数据集组和交互数据集 <a class=\"anchor\" id=\"group_dataset\"></a>\n",
    "[Back to top](#top)\n",
    "\n",
    "Amazon Personalize的最高级别的隔离和抽象是*数据集组*。 存储在这些数据集组之一中的信息不会影响任何其他数据集组或由一个数据集创建的模型-它们是完全隔离的。 这使您可以进行许多实验，并且是我们保持模型私有和仅对您自己的数据进行模型训练。\n",
    "\n",
    "在导入之前准备的数据之前，需要有一个数据集组和一个添加到其中的数据集来处理交互。\n",
    "\n",
    "数据集组可以容纳以下类型的信息：\n",
    "\n",
    "* 用户项目互动\n",
    "* 事件流（实时交互）\n",
    "* 用户元数据\n",
    "* 项目元数据\n",
    "\n",
    "在创建数据集组和用于交互数据的数据集之前，让我们验证您的环境是否可以与Amazon Personalize成功通信。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configure the SDK to Personalize:\n",
    "personalize = boto3.client('personalize')\n",
    "personalize_runtime = boto3.client('personalize-runtime')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 创建数据集组\n",
    "\n",
    "以下单元格将创建一个名为`personalize-poc-lastfm`的新数据集组。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_dataset_group_response = personalize.create_dataset_group(\n",
    "    name = \"personalize-poc-movielens\"\n",
    ")\n",
    "\n",
    "dataset_group_arn = create_dataset_group_response['datasetGroupArn']\n",
    "print(json.dumps(create_dataset_group_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在使用数据集组之前，它必须处于活动状态。 这可能需要一两分钟。 执行下面的单元格，然后等待其显示“活动”状态。 它每60秒检查一次数据集组的状态，最多3小时。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    describe_dataset_group_response = personalize.describe_dataset_group(\n",
    "        datasetGroupArn = dataset_group_arn\n",
    "    )\n",
    "    status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n",
    "    print(\"DatasetGroup: {}\".format(status))\n",
    "    \n",
    "    if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n",
    "        break\n",
    "        \n",
    "    time.sleep(60)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在您有了一个数据集组，您可以为交互数据创建一个数据集。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 创建数据集\n",
    "\n",
    "首先，定义一个Schema以告知Amazon Personalize您要上传的数据集类型。 根据数据集的类型，架构中需要几个保留关键字和强制关键字。 可以在[文档](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html)中找到更多详细信息。\n",
    "\n",
    "在这里，您将为交互数据创建一个Schema，该模式需要`USER_ID`，`ITEM_ID`和`TIMESTAMP`字段。 必须按照与数据集中显示的顺序相同的顺序在架构中对其进行定义。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_schema = schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Interactions\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"USER_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"ITEM_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"EVENT_TYPE\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"TIMESTAMP\",\n",
    "            \"type\": \"long\"\n",
    "        }\n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "\n",
    "create_schema_response = personalize.create_schema(\n",
    "    name = \"personalize-poc-movielens-interactions\",\n",
    "    schema = json.dumps(interactions_schema)\n",
    ")\n",
    "\n",
    "interaction_schema_arn = create_schema_response['schemaArn']\n",
    "print(json.dumps(create_schema_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "创建Schema后，您可以在数据集组中创建一个数据集。 请注意，这尚未加载数据。 这将在以后的几个步骤中发生。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_type = \"INTERACTIONS\"\n",
    "create_dataset_response = personalize.create_dataset(\n",
    "    name = \"personalize-poc-movielens-ints\",\n",
    "    datasetType = dataset_type,\n",
    "    datasetGroupArn = dataset_group_arn,\n",
    "    schemaArn = interaction_schema_arn\n",
    ")\n",
    "\n",
    "interactions_dataset_arn = create_dataset_response['datasetArn']\n",
    "print(json.dumps(create_dataset_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 配置S3存储桶和IAM角色  <a class=\"anchor\" id=\"bucket_role\"></a>\n",
    "[Back to top](#top)\n",
    "\n",
    "到目前为止，我们已经下载，处理了数据并将其保存到连接到运行此Jupyter笔记本的实例的磁盘上。 但是，Amazon Personalize将需要一个S3存储桶作为您的数据源，并需要IAM角色来访问该存储桶。 让我们进行所有设置。\n",
    "\n",
    "使用存储在此Amazon SageMaker笔记本基础实例上的元数据来确定它在其中运行的区域。如果您在Amazon SageMaker之外使用Jupyter笔记本，则只需在下面的字符串中定义区域即可。 Amazon S3存储桶必须与到目前为止我们创建的Amazon Personalize资源位于同一区域。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n",
    "    data = json.load(notebook_info)\n",
    "    resource_arn = data['ResourceArn']\n",
    "    region = resource_arn.split(':')[3]\n",
    "print(region)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Amazon S3存储桶名称在全球范围内是唯一的。 要创建唯一的存储桶名称，以下代码会将字符串`personalizepocvod`附加到您的AWS帐号上。 然后，它将在上一个单元格中发现的区域中创建一个具有此名称的存储桶。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.client('s3')\n",
    "account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
    "bucket_name = account_id + \"-\" + region + \"-\" + \"personalizepocvod\"\n",
    "print(bucket_name)\n",
    "if region == \"us-east-1\":\n",
    "    s3.create_bucket(Bucket=bucket_name)\n",
    "else:\n",
    "    s3.create_bucket(\n",
    "        Bucket=bucket_name,\n",
    "        CreateBucketConfiguration={'LocationConstraint': region}\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 将数据上传到S3\n",
    "\n",
    "现在您的Amazon S3存储桶已创建，请上传我们的用户项目互动数据的CSV文件。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_file_path = data_dir + \"/\" + interactions_filename\n",
    "boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)\n",
    "interactions_s3DataPath = \"s3://\"+bucket_name+\"/\"+interactions_filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 设置S3存储桶策略\n",
    "Amazon Personalize需要能够读取您的S3存储桶中的内容。 因此，添加允许该操作的存储桶策略。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "policy = {\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Id\": \"PersonalizeS3BucketAccessPolicy\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "            \"Sid\": \"PersonalizeS3BucketAccessPolicy\",\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Principal\": {\n",
    "                \"Service\": \"personalize.amazonaws.com\"\n",
    "            },\n",
    "            \"Action\": [\n",
    "                \"s3:*Object\",\n",
    "                \"s3:ListBucket\"\n",
    "            ],\n",
    "            \"Resource\": [\n",
    "                \"arn:aws-cn:s3:::{}\".format(bucket_name),\n",
    "                \"arn:aws-cn:s3:::{}/*\".format(bucket_name)\n",
    "            ]\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n",
    "s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 创建一个IAM角色\n",
    "\n",
    "Amazon Personalize需要具有在AWS中担任角色的能力，以便具有执行某些任务的权限。 让我们创建一个IAM角色并将所需的策略附加到该角色。 下面的代码附有非常宽松的政策； 请对任何生产应用程序使用更多限制性政策。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iam = boto3.client(\"iam\")\n",
    "\n",
    "role_name = \"PersonalizeRolePOC\"\n",
    "assume_role_policy_document = {\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "          \"Effect\": \"Allow\",\n",
    "          \"Principal\": {\n",
    "            \"Service\": \"personalize.amazonaws.com\"\n",
    "          },\n",
    "          \"Action\": \"sts:AssumeRole\"\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n",
    "create_role_response = iam.create_role(\n",
    "    RoleName = role_name,\n",
    "    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)\n",
    ")\n",
    "\n",
    "# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes \"personalize\" or \"Personalize\" \n",
    "# if you would like to use a bucket with a different name, please consider creating and attaching a new policy\n",
    "# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role\n",
    "policy_arn = \"arn:aws-cn:iam::aws:policy/service-role/AmazonPersonalizeFullAccess\"\n",
    "iam.attach_role_policy(\n",
    "    RoleName = role_name,\n",
    "    PolicyArn = policy_arn\n",
    ")\n",
    "\n",
    "# Now add S3 support\n",
    "iam.attach_role_policy(\n",
    "    PolicyArn='arn:aws-cn:iam::aws:policy/AmazonS3FullAccess',\n",
    "    RoleName=role_name\n",
    ")\n",
    "time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate\n",
    "\n",
    "role_arn = create_role_response[\"Role\"][\"Arn\"]\n",
    "print(role_arn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 导入互动数据  <a class=\"anchor\" id=\"import\"></a>\n",
    "[Back to top](#top)\n",
    "\n",
    "早先您创建了数据集组和数据集以容纳您的信息，因此现在您将执行导入作业，该作业会将数据从S3存储桶加载到Amazon Personalize数据集中。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"personalize-poc-import1\",\n",
    "    datasetArn = interactions_dataset_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket_name, interactions_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在使用数据集之前，导入作业必须处于活动状态。 执行下面的单元格，然后等待其显示“活动”状态。 它每60秒检查一次导入作业的状态，最多6个小时。\n",
    "\n",
    "导入数据可能需要一些时间，具体取决于数据集的大小。 在本实验中，数据导入作业大约需要15分钟。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "max_time = time.time() + 6*60*60 # 6 hours\n",
    "while time.time() < max_time:\n",
    "    describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n",
    "        datasetImportJobArn = dataset_import_job_arn\n",
    "    )\n",
    "    status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n",
    "    print(\"DatasetImportJob: {}\".format(status))\n",
    "    \n",
    "    if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n",
    "        break\n",
    "        \n",
    "    time.sleep(60)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据集导入任务成功后，您就可以开始使用SIMS，个性化排名，流行度计数和用户个性化来构建模型了。 此过程将在其他笔记本中继续进行。 运行下面的单元格，然后继续存储一些要在下一个笔记本中使用的值。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store USE_FULL_MOVIELENS\n",
    "%store dataset_dir\n",
    "%store interactions_dataset_arn\n",
    "%store dataset_group_arn\n",
    "%store bucket_name\n",
    "%store role_arn\n",
    "%store role_name\n",
    "%store data_dir\n",
    "%store region\n",
    "%store interaction_schema_arn"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
